Prediction of essential proteins in prokaryotes by incorporating various physico-chemical features into the general form of Chou's pseudo amino acid composition

Protein Pept Lett. 2013 Jul 1;20(7):781-95. doi: 10.2174/0929866511320070008.

Abstract

Prediction of essential proteins of a pathogenic organism is the key for the potential drug target identification, because inhibition of these would be fatal for the pathogen. Identification of these proteins requires the use of complex experimental techniques which are quite expensive and time consuming. We implemented Support Vector Machine algorithm to develop a classifier model for in silico prediction of prokaryotic essential proteins based on the physico-chemical properties of the amino acid sequences. This classifier was designed based on a set of 10 physico-chemical descriptor vectors (DVs) and 4 hybrid DVs calculated from amino acid sequences using PROFEAT and PseAAC servers. The classifier was trained using data sets consisting of 500 known essential and 500 non-essential proteins (n=1,000) and evaluated using an external validation set consisting of 3,462 essential proteins and 5,538 non-essential proteins (n=9,000). The performances of individual DV sets were evaluated. DV set 13, which is the combination of composition, transition and distribution descriptor set and hybrid autocorrelation descriptor set, provided accuracy of 91.2% in 10-fold cross-validation of the training set and an accuracy of 89.7% in external validation set and of 91.8% and 88.1% using a different yeast protein dataset. Our result indicates that this classification model can be used for identification of novel prokaryotic essential proteins.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Amino Acids / analysis
  • Amino Acids / chemistry
  • Bacterial Proteins / chemistry*
  • Bacterial Proteins / genetics
  • Bacterial Proteins / metabolism
  • Computational Biology / methods*
  • Databases, Genetic
  • Fungal Proteins / chemistry*
  • Fungal Proteins / genetics
  • Fungal Proteins / metabolism
  • Models, Chemical*
  • ROC Curve
  • Reproducibility of Results
  • Sequence Analysis, Protein / methods*
  • Support Vector Machine

Substances

  • Amino Acids
  • Bacterial Proteins
  • Fungal Proteins